Closed Bug 1673069 Opened 5 years ago Closed 4 years ago

adjustments to automation scripts to get tests running on OSX/aarch64

Categories

(Testing :: General, task)

Default
task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: bhearsum)

References

Details

Attachments

(4 files, 1 obsolete file)

I have been hacking around on a DTK machine from macstadium. This is using vnc and ssh, not taskcluster.

As part of this work, I ended up with a local script that I would run:

#!/bin/bash

export MOZ_DISABLE_STACK_FIX=0
export MOZ_AUTOMATION=1
export MOZ_FETCHES_DIR=/Users/administrator/cltbld/tasks/task_318/fetches
export GECKO_HEAD_REVISION=3f44c2ba673514ffea7c53a1c39b3f42d42701d4
export GECKO_HEAD_REPOSITORY=https://hg.mozilla.org/try
export MOZ_FETCHES='[{"artifact": "public/build/fix-stacks.tar.xz", "extract": true, "task": "bKKmiaPBT-WZY-joM5CPvg"}, {"artifact": "public/build/minidump_stackwalk.tar.xz", "extract": true, "task": "ESNQVBMARi-8cMOUzWhkpQ"}]'
export TASKCLUSTER_ROOT_URL=https://firefox-ci-tc.services.mozilla.com
export MOZHARNESS_TEST_PATHS='{"mochitest-browser-chrome": ["accessible/tests/browser/general/browser.ini"]}'
export EXTRA_MOZHARNESS_CONFIG='{"installer_url": "https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dazOC-QLRxaQcTXp6mYbpw/artifacts/public/build/target.dmg", "test_packages_url": "https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/dazOC-QLRxaQcTXp6mYbpw/artifacts/public/build/target.test_packages.json"}'
/usr/bin/python3 run-task -- /System/Library/Frameworks/Python.framework/Versions/2.7/bin/python2.7 -u mozharness/scripts/desktop_unittest.py --cfg mozharness/configs/unittests/mac_unittest.py --mochitest-suite=mochitest-plain '--setpref=media.peerconnection.mtransport_process=false' '--setpref=network.process.enabled=false' '--setpref=security.disallow_non_local_systemprincipal_in_tests=false'

Editing the EXTRA_MOZHARNESS_CONFIG will point me to a build/test package.

I have had to do some changes to the browser and scripts to make this work as well.

I have ran what I think are all the mochitest and browser-chrome tests and here is the log file:
https://drive.google.com/file/d/1lvPKfqOnOPnEeUJzZLysLnKzMnARRjvU/view?usp=sharing

keep in mind this is using a browser built for intel and run on the emulator. Not ideal end state, but what users would do today without a specific build.

here are the list of test failures (a pile of these are crashes):

[task 2020-10-23T12:23:28.724Z] 12:23:28    ERROR - TEST-UNEXPECTED-FAIL | dom/base/test/test_audioNotification.html | application terminated with exit code -11                   [task 2020-10-23T12:28:42.196Z] 12:28:42    ERROR - TEST-UNEXPECTED-FAIL | dom/canvas/test/test_imagebitmap.html | application terminated with exit code -11                       [task 2020-10-23T12:37:40.001Z] 12:37:40    ERROR - TEST-UNEXPECTED-FAIL | dom/file/tests/test_mozfiledataurl.html | application terminated with exit code -11                     [task 2020-10-23T12:42:56.485Z] 12:42:56    ERROR - TEST-UNEXPECTED-FAIL | dom/html/test/test_fullscreen-api.html | application terminated with exit code -11                      [task 2020-10-23T12:49:03.418Z] 12:49:03    ERROR - TEST-UNEXPECTED-FAIL | dom/plugins/test/mochitest/test_NPNVdocumentOrigin.html | application terminated with exit code -11     [task 2020-10-23T12:57:12.288Z] 12:57:12    ERROR - TEST-UNEXPECTED-FAIL | dom/security/test/mixedcontentblocker/test_main.html | application terminated with exit code -11        [task 2020-10-23T13:45:07.346Z] 13:45:07     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest.html | helper_hittest_pointerevents_svg.html | top right of scroller in testcase 3 hit info - got "VISIBLE | IRREGULAR_AREA | SCROLLBAR | SCROLLBAR_THUMB | SCROLLBAR_VERTICAL", expected "VISIBLE"                                       [task 2020-10-23T13:45:07.348Z] 13:45:07     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest.html | helper_hittest_pointerevents_svg.html | bottom right of scroller in testcase 3 hit info - got "VISIBLE | SCROLLBAR | SCROLLBAR_VERTICAL", expected "VISIBLE"                                                                       [task 2020-10-23T13:45:07.357Z] 13:45:07     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest.html | helper_hittest_pointerevents_svg.html | top right of scroller in testcase 4 hit info - got "VISIBLE | IRREGULAR_AREA | SCROLLBAR | SCROLLBAR_THUMB | SCROLLBAR_VERTICAL", expected "VISIBLE"                                       [task 2020-10-23T13:45:07.359Z] 13:45:07     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_hittest.html | helper_hittest_pointerevents_svg.html | bottom right of scroller in testcase 4 hit info - got "VISIBLE | SCROLLBAR | SCROLLBAR_VERTICAL", expected "VISIBLE"                                                                       [task 2020-10-23T13:46:15.293Z] 13:46:15     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_mouseevents.html | helper_bug1346632.html | No scrollbars found, cannot run this test!                                                                                                                                                            [task 2020-10-23T13:46:54.196Z] 13:46:54     INFO - TEST-UNEXPECTED-FAIL | gfx/layers/apz/test/mochitest/test_group_wheelevents.html | helper_scroll_over_scrollbar.html | No scrollbars found, cannot run this test!                                                                                                                                                 [task 2020-10-23T13:53:16.320Z] 13:53:16    ERROR - TEST-UNEXPECTED-FAIL | layout/base/tests/test_bug629838.html | application terminated with exit code -11                       [task 2020-10-23T13:54:49.729Z] 13:54:49    ERROR - TEST-UNEXPECTED-FAIL | layout/generic/test/test_invalidate_during_plugin_paint.html | application terminated with exit code -11                                                                                                                                                                                   [task 2020-10-23T14:18:00.349Z] 14:18:00     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html | Test timed out.       [task 2020-10-23T14:18:01.365Z] 14:18:01     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html | Extension left running at test shutdown                                                                                                                                                                  [task 2020-10-23T14:18:01.567Z] 14:18:01     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html | message queue is empty - got "[\"success\"]", expected "[]"                                                                                                                                              [task 2020-10-23T14:18:01.567Z] 14:18:01     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html | no tasks awaiting on messages - got "[\"error\"]", expected "[]"                                                                                                                                         [task 2020-10-23T14:18:01.567Z] 14:18:01     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html | Extension left running at test shutdown
[task 2020-10-23T14:18:06.748Z] 14:18:06    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html logged result after SimpleTest.finish(): Extension left running at test shutdown
[task 2020-10-23T14:18:06.748Z] 14:18:06    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html logged result after SimpleTest.finish(): message queue is empty
[task 2020-10-23T14:18:06.748Z] 14:18:06    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html logged result after SimpleTest.finish(): no tasks awaiting on messages
[task 2020-10-23T14:18:06.748Z] 14:18:06    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_contentscript_canvas.html logged result after SimpleTest.finish(): Extension left running at test shutdown
[task 2020-10-23T14:18:44.062Z] 14:18:44     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tabId returned for store - Expected: 1, Actual: 2
[task 2020-10-23T14:18:44.871Z] 14:18:44     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tab returned for store - Expected: 1, Actual: 2
[task 2020-10-23T14:18:44.873Z] 14:18:44     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tab returned for private store - Expected: 1, Actual: 2
[task 2020-10-23T14:18:58.791Z] 14:18:58     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tabId returned for store - Expected: 1, Actual: 2
[task 2020-10-23T14:18:59.570Z] 14:18:59     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tab returned for store - Expected: 1, Actual: 2
[task 2020-10-23T14:18:59.573Z] 14:18:59     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_cookies.html | one tab returned for private store - Expected: 1, Actual: 2
[task 2020-10-23T14:33:07.050Z] 14:33:07     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | Test timed out.
[task 2020-10-23T14:33:08.064Z] 14:33:08     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | Extension left running at test shutdown
[task 2020-10-23T14:33:08.103Z] 14:33:08     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | no tasks awaiting on messages - got "[\"done\"]", expected "[]"
[task 2020-10-23T14:33:14.310Z] 14:33:14    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html logged result after SimpleTest.finish(): Extension left running at test shutdown
[task 2020-10-23T14:33:14.310Z] 14:33:14    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html logged result after SimpleTest.finish(): no tasks awaiting on messages
[task 2020-10-23T14:47:24.885Z] 14:47:24     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | Test timed out.
[task 2020-10-23T14:47:25.885Z] 14:47:25     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | Extension left running at test shutdown
[task 2020-10-23T14:47:25.928Z] 14:47:25     INFO - TEST-UNEXPECTED-FAIL | toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html | no tasks awaiting on messages - got "[\"done\"]", expected "[]"
[task 2020-10-23T14:47:32.168Z] 14:47:32    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html logged result after SimpleTest.finish(): Extension left running at test shutdown
[task 2020-10-23T14:47:32.168Z] 14:47:32    ERROR - TEST-UNEXPECTED-FAIL | /tests/toolkit/components/extensions/test/mochitest/test_ext_webrequest_basic.html logged result after SimpleTest.finish(): no tasks awaiting on messages
[task 2020-10-23T14:59:39.833Z] 14:59:39    ERROR - TEST-UNEXPECTED-FAIL | toolkit/content/tests/widgets/test_bug898940.html | application terminated with exit code -11

It was recommended to enable scrollbars by default to solve some of this, I have done that on my macstadium DTK, but need to run the tests again.

hacks to get mochitest running on apple osx DTK machines, currently WIP

I've been helping Joel get generic worker running on his DTK machine. Here's a few notes from that:

  • Working configs are available in /etc/generic-worker on that machine. The non-secret parts that are notable are:
  workerPoolID: releng-hardware/gecko-t-osx-dtk-dev
  workerGroup: gecko-t
  workerID: osx-dtk-dev-1
  workerLocation:
    host: macstadium
taskcluster-linux-amd64 api workerManager createWorker releng-hardware/gecko-t-osx-dtk-dev gecko-t osx-dtk-dev-1 < ~/tmp/foo.json

foo.json contained:

{
    "expires": "2021-10-24T18:18:19.315Z",
    "capacity": 1,
    "providerInfo": {
        "staticSecret": "XXXXXXXXXXXXX"
    }
}

...where the secret is the same thing in /etc/generic-worker/runner.yml on the worker.

We're now at a point the worker runs and will claim tasks like https://firefox-ci-tc.services.mozilla.com/tasks/WR7Iv7kLS8ijcNvaS1ygUg, but they always fail for reasons I'm not entirely certain about -- probably something to do with how we set up generic worker. I think it probably makes the most sense to switch it to single build mode, but I haven't been able to do that successfully.

Pete, do you have any ideas here? We can send you the credentials for this machine if you want to poke aronud.

Flags: needinfo?(pmoore)

I managed to get the generic worker part working by switching to the single mode binary, and running it as root (there were permissions issues running as genericworker).

Now I'm hitting some task-specific issues such as:

Executing command 1: /usr/local/bin/python3 run-task -- /usr/local/bin/python2 -u 'mozharness/scripts/desktop_unittest.py' --cfg 'mozharness/configs/unittests/mac_unittest.py' --mochitest-suite=mochitest-plain '--setpref=media.peerconnection.mtransport_process=false' '--setpref=network.process.enabled=false' --download-symbols true
[taskcluster 2020-10-26T17:51:53.842Z] System error executing command: fork/exec /usr/local/bin/python3: no such file or directory

...but I should be able to work through these myself.

Flags: needinfo?(pmoore)

I've tried both the simple and multi generic worker engines, which end up failing for different reasons.
With simple, I first tried running it as genericworker, which failed with:

2020/10/26 09:44:36 mkdir /Users/task_160373067645543: permission denied

Running simple as root (after installing python2 + python3 ran the task OK, but had this issue inside of the task:

could not find user worker; specify a valid user with --user

That error comes from run-task, and because worker doesn't exist on these machines, it makes me think that using simple isn't known to work at all here.

Finally, the multi engine running as root hits:

      "description": "Interactive username task_160348627139454 does not match task user task_160373691398454 from next-task-user.json file",

I ran into these same problems today also.

For the "simple", single-user, I found it works if running under an auto-login gui user (this is how current prod tests are still running with the pre-multiuser generic-worker).

For the multi-user problem, I've seen that same mis-match happen on the few multiuser workers we run. I think that is from generic-worker's tracking files being wrong -- you can delete them and try to restart or update the json files with the right username. I don't remember the cause from when I've seen it, but I think it was something with when generic-worker cores.

The error about the wrong task user being logged in would happen if generic worker expected one task user to be logged in, but discovered at runtime that a different one was logged in. It configures an automatic login for a given task user, and then validates when the login occurs, that indeed that expected user is the one that has logged in.

Possible causes could be if there is an interactive VNC login while generic-worker is running, which interfered with the console user. But there may be another issue at play. If it happens again, could you capture the generic-worker log file(s)? This shouldn't happen, but I have experienced it before when I've needed to VNC onto a machine and make changes. In this case, the best course of action is as Dave suggested, to delete the generic-worker state files, in which case generic-worker will consider that it is the very first run of generic-worker, create a new task user, and then reboot into it.

I would recommend using the multiuser engine rather than the simple engine, as it should be much more secure, and operate like the Windows and Linux multiuser engine workers, creating a unique OS user for each task, which should provide better isolation between tasks than the simple engine can offer, where all tasks run as the same user, and can potentially interfere with future tasks by soiling the user account.

Any questions let me know. I'm also happy to jump onto Zoom and do a screen share if it helps to look at a problem together...

Thanks for the helpful comments! I switched back to the multi engine and got past the aforementioned issues. Now I'm hitting a new one:

[taskcluster:error] [mounts] [mounts] Not able to make directory /Users/task_160382347844829 writable for task_160382347844829: exit status 1

I tried clobbering all state (/var/generic-worker/, deleting task_ users + their homedirs) and starting from scratch - which didn't help. After initially starting the worker, it registered OK and state is as follows:

administrator@28135 generic-worker % sudo cat current-task-user.json 
{
  "name": "task_160382347844829",
  "password": "redacted"
}
administrator@28135 generic-worker % sudo cat next-task-user.json 
{
  "name": "task_160382354105355",
  "password": "redacted"
}
administrator@28135 generic-worker % sudo cat directory-caches.json 
{}
administrator@28135 generic-worker % sudo cat file-caches.json 
{}
administrator@28135 generic-worker % sudo cat tasks-resolved-count.txt 
0%                                         

I ended up with two users as well:

28135:~ root# dscl . list /Users | grep task
task_160382347844829
task_160382354105355

After it took a task, state was as follows:

administrator@28135 generic-worker % sudo cat current-task-user.json 
Password:
{
  "name": "task_160382354105355",
  "password": "redacted"
}
administrator@28135 generic-worker % sudo cat directory-caches.json 
{}
administrator@28135 generic-worker % sudo cat file-caches.json 
{
  "artifact:X5XjPiC6TC2tpFef1H11tg:public/build/mozharness.zip": {
    "created": "2020-10-27T11:38:13.344073-07:00",
    "location": "/Library/Caches/generic-worker/downloads/Q79hppZ7Q9iDzmieEBw5QA",
    "hits": 1,
    "key": "artifact:X5XjPiC6TC2tpFef1H11tg:public/build/mozharness.zip",
    "sha256": "e85f8411f38e2b574604a937e5c5805429d672913d1086daf194104b679f75f4"
  }
}
administrator@28135 generic-worker % sudo cat next-task-user.json 
{
  "name": "task_160382395355955",
  "password": "redacted"
}
administrator@28135 generic-worker % sudo cat tasks-resolved-count.txt 
1%

And for users, I have:

28135:~ root# dscl . list /Users | grep task
task_160382354105355
task_160382395355955

The obvious thing that stands out here is that the username in the log (task_160382347844829) is not either of the ones listed in the user list after the task ran, but rather in the user list between the worker registering and the task even being created. Maybe this is expected, if generic worker is precreating the user for the next run -- but it's the only notable thing I can find in the log or state.

I think the next best step is for us to do a screen share - do you want to ping me when you're online, and we can hop on zoom?

The error Not able to make directory /Users/task_160382347844829 writable for task_160382347844829: exit status 1 suggests a problem running /usr/sbin/chown -R task_160382347844829:staff /Users/task_160382347844829 (see makeFileOrDirReadWritableForUser). But since task_160382347844829 is not listed as a user by dscl that would explain why the command failed, although I'm not sure why it would try to use that user. Does the home directory /Users/task_160382347844829 exist?

In addition to cleaning up the generic-worker state files, it might be best to aggresively remove any task directories /Users/task_* before restarting generic-worker. But in any case it is strange, so I would like to do a screen share together to see if we can get to the bottom of it, so that hopefully other users don't stumble on the same strange issue. Thanks Ben!

I think I've discovered the source of the issue. The task being submitted extracts a zip file to the directory .:

    "mounts": [
      {
        "format": "zip",
        "content": {
          "taskId": "X5XjPiC6TC2tpFef1H11tg",
          "artifact": "public/build/mozharness.zip"
        },
        "directory": "."
      },

The problem is that generic-worker aggressively sets recursive directory permissions for the mounted folder, and since the directory . is the home directory of the task user, it tries to affect the ownership of Desktop and directories under Library that upsets Big Sur (presumably this was no problem on Catalina).

2020/10/28 08:09:44 Running command: /usr/sbin/chown -R task_160389635555599:staff /Users/task_160389635555599
2020/10/28 08:09:45 Error running command:
2020/10/28 08:09:45 chown: /Users/task_160389635555599/Desktop: Operation not permitted
chown: /Users/task_160389635555599/Desktop: Operation not permitted
chown: /Users/task_160389635555599/Library/Application Support/CallHistoryTransactions: Operation not permitted
chown: /Users/task_160389635555599/Library/Application Support/CallHistoryTransactions: Operation not permitted
chown: /Users/task_160389635555599/Library/Application Support/com.apple.sharedfilelist: Operation not permitted
.....
.....

There are a few different options we could use to solve this issue:

  1. The task could be adapted to extract the mozharness zip file to a subdirectory, and then the paths updated.
  2. I think the mozharness zip already contains a top level mozharness directory inside the zip file, so perhaps the zip file could be changed not to include the top level mozharness directory, so that paths remain the same, but the directory to unzip to could be changed to mozharness
  3. We could change the mounts feature to only try to apply permissions changes to directories, if those directories do not exist already.

I suspect option 3) is the correct approach in any case, but 1) or 2) would be reasonable workarounds in the meantime. Although 3) is a pretty easy fix for generic-worker. (see comment 11)

We didn't see this issue in our testing, as we don't have tasks that extract archives to ".".

So looking a little closer, the reason the file permissions are changed, is because the zip is extracted by generic-worker, which runs as root, so the subsequent file permission change is to make sure whatever it extracted is owned by the task user. Since it doesn't know which files were contained inside the zip file, it recursively sets the permissions of everything in the directory that the zip file was extracted to.

Instead we should extract mounted archives as the task user directly, to be sure they don't overwrite anything they shouldn't have permission to overwrite. Not only would this be safer, but it would also avoid a two step process of extracting then changing permissions, since the extraction would be a single step. It would also help protect against extraction vulnerabilities such as zip-slip.

So I propose adopting either short term fix 1 or 2 from comment 10 above, or even a new option:

  1. Extract to a subdirectory "mozharness" and add an additional command to the task, something like: mv mozharness/* ..

Option 4 is similar to option 1, but helps if there are hardcoded references in other places that make option 1 infeasible.

We may already have a bug on file for extracting archives as the task user in multiuser engine - I'll check. We should bump the priority on it. When that is implemented, this issue would disappear, since there would no longer be permission changes made after extracting archives, they would already have the correct file permissions on extraction.

(In reply to Pete Moore [:pmoore][:pete] from comment #11)

We may already have a bug on file for extracting archives as the task user in multiuser engine - I'll check.

I created bug 1673922 for this.

(In reply to Pete Moore [:pmoore][:pete] from comment #10)

There are a few different options we could use to solve this issue:

  1. The task could be adapted to extract the mozharness zip file to a subdirectory, and then the paths updated.
  2. I think the mozharness zip already contains a top level mozharness directory inside the zip file, so perhaps the zip file could be changed not to include the top level mozharness directory, so that paths remain the same, but the directory to unzip to could be changed to mozharness

Either option here sounds easy enough and something I can help out with. I've been looking for a way to be useful on this project anyway.. Sort of sounds like we should avoid extracting things in the home directory regardless of bug 1673922. I don't think updating the paths will be too difficult (and is easily tested on try). I'll file a new bug.

Depends on: 1673992

Big thanks to Pete for helping sort out the issues with the multi engine on generic worker. For the moment, I've gone back to the simple engine, because I discovered that it's what we've already been using on the Intel Big Sur machines. I do think we should switch back to multi as soon as we can though.

The aforementioned issues on simple (errors chown /Users/task_*) were fixed by setting tasksDir to something else -- which doesn't have the same restrictions (I used /var/genericworker-tasks - but it can be almost anything).

After that, tasks actually started to run! The next issue I hit was errors with Python SSL which were solved by running:

/Applications/Python 3.8/Install Certificates.command
/Library/Frameworks/Python.framework/Versions/2.7/Resources/Python.app/Contents/MacOS/Python -E -s -m easy_install pip
/Applications/Python 2.7/Install Certificates.command

...to install the certificates.

However, even with that done, I'm still getting failures like:

[task 2020-10-28T20:14:45.352Z] 20:14:45     INFO - retry: attempt #5 caught URLError exception: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed (_ssl.c:727)>
[task 2020-10-28T20:14:45.352Z] 20:14:45    FATAL - Can't download from https://firefox-ci-tc.services.mozilla.com/api/queue/v1/task/X5XjPiC6TC2tpFef1H11tg/artifacts/public/build/target.test_packages.json to /private/var/genericworker-tasks/task_160391531113808/build/target.test_packages.json!

Curiously, if I run mozharness by hand it seems to work OK, which makes me think there's something in the task config or run-task influencing it - but I don't see anything that could.

(In reply to Andrew Halberstadt [:ahal] from comment #13)

Either option here sounds easy enough and something I can help out with. I've been looking for a way to be useful on this project anyway.. Sort of sounds like we should avoid extracting things in the home directory regardless of bug 1673922. I don't think updating the paths will be too difficult (and is easily tested on try). I'll file a new bug.

Thanks Andrew!

Ben, I realised another temporary solution is simply specifying a different value for tasksDir under multiuser engine. By default it is /Users but you could change it to e.g. /tasks or /var/tasks or something similar that Big Sur allows. The home directory and the task directory are both purged after a task run - if they happen to be the same directory, that is ok, but they don't have to be. For example on Windows, the home directory is typically on C:\ but the task directories are on Z:\ so the multiuser engine is already designed to clean up both after a task. As soon as the task directory moves to a different location, the home directory system files and directories (such as Library, Desktop, ....) will no longer be in the task directory, so generic-worker won't try to modify their permissions, and the problem goes away. Fingers crossed, with this small change, things should just simply work!

Note, we should still implement bug 1673922 and bug 1673992, but changing tasksDir should at least unblock using multiuser engine.

Depends on: 1674203
No longer depends on: 1673992
No longer depends on: 1674203

(In reply to Pete Moore [:pmoore][:pete] from comment #16)

Ben, I realised another temporary solution is simply specifying a different value for tasksDir under multiuser engine. By default it is /Users but you could change it to e.g. /tasks or /var/tasks or something similar that Big Sur allows. The home directory and the task directory are both purged after a task run - if they happen to be the same directory, that is ok, but they don't have to be. For example on Windows, the home directory is typically on C:\ but the task directories are on Z:\ so the multiuser engine is already designed to clean up both after a task. As soon as the task directory moves to a different location, the home directory system files and directories (such as Library, Desktop, ....) will no longer be in the task directory, so generic-worker won't try to modify their permissions, and the problem goes away. Fingers crossed, with this small change, things should just simply work!

Note, we should still implement bug 1673922 and bug 1673992, but changing tasksDir should at least unblock using multiuser engine.

Looks like we're able to use /Users as the tasksDir with ahal's patch: https://phabricator.services.mozilla.com/D95138 -- that should be landed on central sometime today, so I think we're unblocked here.

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #17)

Looks like we're able to use /Users as the tasksDir with ahal's patch: https://phabricator.services.mozilla.com/D95138 -- that should be landed on central sometime today, so I think we're unblocked here.

Great! In the meantime, I just coincidentally discovered https://github.com/taskcluster/taskcluster/commit/76d281c72691a3bed814f11bf9ad5767c47fa967#diff-dca1e9fc9cc768f225a74371a9fb518db060e3e3e872093d22c9f3632489cf3dR382-R389 from 8 months ago - and worst of all, it is my own comment, so I should have remembered!

Assignee: nobody → bhearsum
Status: NEW → ASSIGNED
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/a858c6c690cb add worker pool for dtk machines. r=releng-reviewers,mtabara

The one machine we have set up in the pool is largely OK now, but ci-admin is creating secrets for it that break it (specifically, https://firefox-ci-tc.services.mozilla.com/secrets/worker-pool%3Areleng-hardware%2Fgecko-t-osx-dtk-dev). I'm trying to sort that out, but in the meantime that secret may have to be deleted periodically, and the worker kicked (by unloading and reloading the launch plist) to keep it up and running.

Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/4fd3cd9906f1 set implementation for dtk worker pool to avoid docker worker secrets being created. r=dhouse
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-configuration/rev/66da55814060 backout dtk worker pool implementation, because it's invalid
Pushed by bhearsum@mozilla.com: https://hg.mozilla.org/ci/ci-admin/rev/b585613d6b99 hack to avoid copying stateless secret into mac dtk worker pool

(In reply to bhearsum@mozilla.com (:bhearsum) from comment #21)

The one machine we have set up in the pool is largely OK now, but ci-admin is creating secrets for it that break it (specifically, https://firefox-ci-tc.services.mozilla.com/secrets/worker-pool%3Areleng-hardware%2Fgecko-t-osx-dtk-dev). I'm trying to sort that out, but in the meantime that secret may have to be deleted periodically, and the worker kicked (by unloading and reloading the launch plist) to keep it up and running.

I sorted out this issue with secrets.

Also, for the record, tests can be triggered on this machine by using the --worker-override setting of ./mach try. eg:
./mach try fuzzy --worker-override "t-osx-1014=releng-hardware/gecko-t-osx-dtk-dev"

Here's an example push: https://treeherder.mozilla.org/jobs?repo=try&tier=1&revision=3b8b55a36abcc912b94c5e59a64c1853e3c965da&searchStr=1014-64%2Fopt

nothing else to do here!

Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Attachment #9183511 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: